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Abstract 

We present Dual-Loco, a communication- 
efficient algorithm for distributed statistical 
estimation. Dual-Loco assumes that the 
data is distributed across workers according 
to the features rather than the samples. It 
requires only a single round of communica¬ 
tion where low-dimensional random projec¬ 
tions are used to approximate the depen¬ 
dencies between features available to differ¬ 
ent workers. We show that Dual-Loco has 
bounded approximation error which only de¬ 
pends weakly on the number of workers. We 
compare Dual-Loco against a state-of-the- 
art distributed optimization method on a va¬ 
riety of real world datasets and show that it 
obtains better speedups while retaining good 
accuracy. In particular, Dual-Loco allows 
for fast cross validation as only part of the 
algorithm depends on the regularization pa¬ 
rameter. 


1 Introduction 

Many statistical estimation tasks amount to solving an 
optimization problem of the form 

min lJ(/3) := + ^\\/3\\l (1) 

i—1 

where A > 0 is the regularization parameter. The 
loss functions /,(/3 T x,) depend on labels yt € R. and 
linearly on the coefficients, (3 through a vector of co¬ 
variates, Xj £ R p . Furthermore, we assume all _/) to 
be convex and smooth with Lipschitz continuous gra¬ 
dients. Concretely, when /j(/3 T Xj) = (y t — /3 T Xi) 2 , 
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Eq. ([I]) corresponds to ridge regression; for logistic re¬ 
gression fi((3 T Xi) = log (1 + exp (-2/i/3 T Xj)). 

For large-scale problems, it is no longer practical to 
solve even relatively simple estimation tasks such as 
([I]) on a single machine. To deal with this, approaches 
to distributed data analysis have been proposed that 
take advantage of many cores or computing nodes on 
a cluster. A common idea which links many of these 
methods is stochastic optimization. Typically, each of 
the workers only sees a small portion of the data points 
and performs incremental updates to a global parame¬ 
ter vector. It is typically assumed that the number of 
data points, n, is very large compared with the number 
of features, p, or that the data is extremely sparse. In 
such settings - which are common, but not ubiquitous 
in large datasets - distributed stochastic optimization 
algorithms perform well but may converge slowly oth¬ 
erwise. 

A fundamentally different approach to distributing 
learning is for each worker to only have access to a 
portion of the available features. Distributing accord¬ 
ing to the features could be a preferable alternative 
for several reasons. Firstly, for high-dimensional 
data, where p is large relative to n, better scaling 
can be achieved. This setting is challenging, how¬ 
ever, since most loss functions are not separable across 
coordinates. High-dimensional data is commonly en¬ 
countered in the fields of bioinformatics, climate sci¬ 
ence and computer vision. Furthermore, for a vari¬ 
ety of prediction tasks it is often beneficial to map 
input vectors into a higher dimensional feature space, 
e.g. using deep representation learning or considering 
higher-order interactions. Secondly, privacy. Indi¬ 
vidual blocks of features could correspond to sensitive 
information (such as medical records) which should be 
included in the predictive model but is not allowed to 
be communicated in an un-disguised form. 

Our contribution. In this work we introduce 
Dual-Loco to solve problems of the form (jT]) in the 
distributed setting when each worker only has access 
to a subset of the features. Dual-Loco is an exten- 
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sion of the LOCO algorithm [T] which was recently pro¬ 
posed for solving distributed ridge regression in this 
setting. We propose an alternative formulation where 
each worker instead locally solves a dual optimization 
problem. Dual-Loco has a number of practical and 
theoretical improvements over the original algorithm: 

• Dual-Loco is applicable to a wider variety of 
smooth, convex £2 penalized loss minimization 
problems encompassing many widely used regres¬ 
sion and classification loss functions, including 
ridge regression, logistic regression and others. 

• In (|4]we provide a more intuitive and tighter the¬ 
oretical result which crucially does not depend on 
specific details of the ridge regression model and 
has weaker dependence on the number of workers, 
K. 

• We also show that adding (rather than concate¬ 
nating) random features allows for an efficient 
implementation yet retains good approximation 
guarantees. 

In f|5] we report experimental results with high¬ 
dimensional real world datasets corresponding to 
two different problem domains: climate science and 
computer vision. We compare Dual-Loco with 
CoCoA + , a recently proposed state-of-the-art algo¬ 
rithm for distributed dual coordinate ascent [2j. Our 
experiments show that Dual-Loco demonstrates bet¬ 
ter scaling with K than CoCoA + while retaining 
a good approximation of the optimal solution. We 
provide an implementation of Dual-Loco in Apache 
SparlQ The portability of this framework ensures that 
Dual-Loco is able to be run in a variety of distributed 
computing environments. 

2 Related work 

2.1 Distributed Estimation 

Recently, several asynchronous stochastic gradient de¬ 
scent (SGD) methods [31 jj have been proposed for 
solving problems of the form ([I]) in a parallel fash¬ 
ion in a multi-core, shared-memory environment and 
have been extended to the distributed setting. For 
such methods, large speedups are possible with asyn¬ 
chronous updates when the data is sparse. However, in 
some problem domains the data collected is dense with 
many correlated features. Furthermore, the p^> n set¬ 
ting can result in slow convergence. In the distributed 
setting, such methods can be impractical since the cost 
of communicating updates can dominate other compu¬ 
tational considerations. 

: http://spark.apache.org/ 


Jaggi et al. proposed a communication-efficient dis¬ 
tributed dual coordinate ascent algorithm (C 0 C 0 A 
resp. CoCoA + ) [HE]. Each worker makes multiple 
updates to its local dual variables before communi¬ 
cating the corresponding primal update. This allows 
for trading off communication and convergence speed. 
Notably they show that convergence is actually inde¬ 
pendent of the number of workers, thus CoCoA + ex¬ 
hibits strong scaling with K. 

Other recent work considers solving statistical estima¬ 
tion tasks using a single round of communication lam. 
However, all of these methods consider only distribut¬ 
ing over the rows of the data where an i.i.d. assump¬ 
tion on the observations holds. 

On the other hand, few approaches have considered 
distributing across the columns (features) of the data. 
This is a more challenging task for both estimation and 
optimization since the columns are typically assumed 
to have arbitrary dependencies and most commonly 
used loss functions are not separable over the features. 
Recently, Loco was proposed to solve ridge regression 
when the data is distributed across the features [I]. 
Loco requires a single round to communicate small 
matrices of randomly projected features which approx¬ 
imate the dependencies in the rest of the dataset (cf. 
Figure [Tj) . Each worker then optimizes its own sub¬ 
problem independently and finally sends its portion 
of the solution vector back to the master where they 
are combined. LOCO makes no assumptions about the 
correlation structure between features. It is therefore 
able to perform well in challenging settings where the 
features are correlated between blocks and is partic¬ 
ularly suited when p n. Indeed, since the relative 
dimensionality of local problems decreases when split¬ 
ting by columns, they are easier in a statistical sense. 
Loco makes no assumptions about data sparsity so it 
is also able to obtain speedups when the data is dense. 

One-shot communication schemes are beneficial as the 
cost of communication consists of a fixed cost and a 
cost that is proportional to the size of the message. 
Therefore, it is generally cheaper to communicate a 
few large objects than many small objects. 

2.2 Random projections for estimation and 
optimization 

Random projections are low-dimensional embeddings 
LE : R T —> which approximately preserve an 

entire subspace of vectors. They have been extensively 
used to construct efficient algorithms when the sample- 
size is large in a variety of domains such as: nearest 
neighbours [Sj, matrix factorization {§], least squares 
EMEU and recently in the context of optimization 

m • 
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Original design 
matrix 


Original design matrix, 
distributed across 
K workers 
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matrices 


Figure 1: Schematic for the distributed approximation of 
a large data set with random projections, used by Dual- 
LOCO. 


primal solution has the form (3*(a*) = -^X T a*. 

Clearly, a similar dual problem to 0 can be defined 
in the projected space. Defining K = (Xn)(XII) T we 
have 

n i 

max - f*( a i ) - (3) 

a£l" 2n\ 

l—l 

Importantly, the vector of dual variables does not 
change dimension depending on whether the origi¬ 
nal problem 0 or the projected problem ([3]) is be¬ 
ing solved. Under mild assumptions on the loss func¬ 
tion, by mapping the solution to this new problem, 
a, back to the original space one obtains a vector 
/3(d) = — ^jX T ct , which is a good approximation 
to f3 *, the solution to the original problem 0 (HUE]- 

3 The Dual-Loco algorithm 


We concentrate on the Subsampled Randomized 
Hadamard Transform (SRHT), a structured random 
projection jT3j. The SRHT consists of a projection ma¬ 
trix, FT = \Jt/t su f, s DHS with the definitions: (i) 
S £ i s a subsampling matrix, (ii) D £ R TXr 

is a diagonal matrix whose entries are drawn inde¬ 
pendently from {—1,1}. (iii) H £ M TXT is a nor¬ 
malized Walsh-Hadamard matrix. The key benefit of 
the SRHT is that due to its recursive definition the 
product between II T and u £ K T can be computed in 
O (r log t) time while never constructing II explicitly. 

For moderately sized problems, random projections 
have been used to reduce the dimensionality of the 
data prior to performing regression [H mg. How¬ 
ever after projection, the solution vector is in the 
compressed space and so interpretability of coeffi¬ 
cients is lost. Furthermore, the projection of the 
low-dimensional solution back to the original high¬ 
dimensional space is in fact guaranteed to be a bad 
approximation of the optimum nn. 

Dual Random Projections. Recently, [16| HZ! 
studied the effect of random projections on the dual 
optimization problem. For the primal problem in 
Eq. 0, defining K = XX T , we have the correspond¬ 
ing dual 

n 

f:{ai) ~^\ aTKa (2) 

i -1 

where f* is the conjugate Fenchel dual of / and 
A > 0. For example, for squared loss functions fi(u) = 
\(yi — u) 2 , we have f*(a) = \oi 1 + ayt. For prob¬ 
lems of this form, the dual variables can be directly 
mapped to the primal variables, such that for a vec¬ 
tor a* which attains the maximum of 0- the optimal 


Algorithm 1 Dual-Loco 
Input: Data: X, Y , no. workers: K 
Parameters: r su b s , A 


f: 


Partition {p} into K subsets of equal size r and 
distribute feature vectors in X accordingly over 
K workers. 

for each worker k £ {1, ... K} in parallel do 

Compute and send random features X^IIfc. 
Receive random features and construct X^. 
d k ■£- LocalDualSolver(Xfc, Y, A) 

@k = ^nX-^-k&k 

Send /3 k to driver. 

end for 


Output: Solution vector: (3 = f3 1 ,...,(3 K 


In this section we detail the Dual-Loco algorithm. 
Dual-Loco differs from the original LOCO algorithm 
in two important ways, (i) The random features from 
each worker are summed, rather than concatenated, 
to obtain a T su b s dimensional approximation allowing 
for an efficient implementation in a large-scale dis¬ 
tributed environment, (ii) Each worker solves a local 
dual problem similar to §. This allows us to extend 
the theoretical guarantees to a larger class of estima¬ 
tion problems beyond ridge regression (Q. 

We consider the case where p features are distributed 
across K different workers in non-overlapping subsets 
Vi,, Vk of equal siz0 r = p/K. 

Since most loss functions of interest are not separa¬ 
ble across coordinates, a key challenge addressed by 
Dual-Loco is to define a local minimization prob¬ 
lem for each worker to solve independently and asyn¬ 
chronously while still maintaining important depen- 

2 This is for simplicity of notation only, in general the 
partitions can be of different sizes. 
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dencies between features in different blocks and keep¬ 
ing communication overhead low. Algorithm [l] details 
Dual-Loco in full. 

We can rewrite 0 making explicit the contribution 
from block k. Letting X*, £ M raXT be the sub-matrix 
whose columns correspond to the coordinates in Vk 
(the “raw” features of block k) and X(_^) £ jgmx(p—r) 
be the remaining columns of X, we have 

n 

^ ' fi (^-i,kfi raw A — 
i—1 

+ Mll/^rawll! + ll^(-fc)lli)- ( 4 ) 

Where x^*, and x i( -_ fc ) are the rows of X*, and X(_ fc ) 
respectively. We replace X(_ fc j in each block with 
a low-dimensional randomized approximation which 
preserves its contribution to the loss function. This 
procedure is described in Figure [l] 

In Step 5, these matrices of random features are com¬ 
municated and worker k constructs the matrix 


X fc £ 




k'^k 


Xi.ru 


(5) 


which is the concatenation of worker fc’s raw features 
and the sum of the random features from all other 
workers. II is the SRHT matrix introduced in §2.2| 

As we prove in Lemma [2j summing R T —» 
dimensional random projections from (K — 1) blocks 
is equivalent to computing the —» R Ts " be '- 

dimensional random projection in one go. The latter 
operation is impractical for very large p and not ap¬ 
plicable when the features are distributed. Therefore, 
summing the random features from each worker allows 
the dimensionality reduction to be distributed across 
workers. Additionally, the summed random feature 
representation can be computed and combined very 
efficiently. We elaborate on this aspect in ([5] 


For a single worker the local, approximate primal prob¬ 
lem is then 

n , 

_ min Jk((3) := ^/i(/3 T Xi) + -\\(3\\l (6) 

flei T+T ™i.s 2 

1=1 

where x* £ is the i th row of X*,. The corre¬ 

sponding dual problem for each worker in the Dual- 
Loco algorithm is 


1 

max - E - ^« T K,a, K k = X k Xj. 

aGt n z —' In A 

i= 1 

(7) 

The following steps in Algorithm [I] detail respectively 
how the solution to ([T]) and the final Dual-Loco es¬ 
timates are obtained. 


Step 6. LocalDualSolver. The LocalDualSolver 
computes the solution for ([7|, the local dual problem. 
The solver can be chosen to best suit the problem at 
hand. This will depend on the absolute size of n and 
t + T su b s as well as on their ratio. For example, we 
could use SDCA [18] or Algorithm 1 from [151 . 


Step 7. Obtaining the global primal solution. 

Each worker maps its local dual solution to the pri¬ 
mal solution corresponding only to the coordinates in 
Vk ■ In this way, each worker returns coefficients corre¬ 
sponding only to its own raw features. The final primal 
solution vector is obtained by concatenating the K lo¬ 
cal solutions. Unlike LOCO, we no longer require to 
discard the coefficients corresponding to the random 
features for each worker. Consequently, computing es¬ 
timates is more efficient (especially when n). 

4 DUAL-LOCO Approximation Error 


In this section we bound the recovery error between 
the Dual-Loco solution and the solution to Eq. 0. 

Theorem 1 (Dual-Loco error bound). Consider a 
matrix X £ M" xp with rank, r. Assume that the loss 
/(•) is smooth and Lipschitz continuous. For a sub¬ 
sampling dimension T su b s > c\pK where 0 < c\ < 
1/K 2 , let (3* be the solution to ([Tj) and /3 be the esti¬ 
mate returned by Algorithm [IJ We have with probabil¬ 
ity at least 1 — K (5 + fe?) 


n/ 3 -r 


< 


where 


1 — E 


||/3* 


( 8 ) 


£ = 


I Co log(2r/<5)r 

Cip 


< 1. 


Proof. By Lemma [4] and applying a union bound we 
can decompose the global optimization error in terms 
of the error due to each worker as \\(3* — /3||2 = 

\/X)fcLi ll/ 3 * — 3felli < VK^m, which holds 
with probability 1 —A' (6 + The final bound, ([8]) 

follows by setting p = yj c ° an d T subs > c\pK 

_ £ 

and noting that y/~K • , '-A < A-. □ 

Vk 

Theorem [T] guarantees that the solution to Dual- 
Loco will be close to the optimal solution obtained 
by a single worker with access to all of the data. Our 
result relies on the data having rank r <C p. In prac¬ 
tice, this assumption is often fulfilled, in particular 
when the data is high dimensional. For a large enough 
projection dimension, the bound has only a weak de¬ 
pendence on K through the union bound used to de¬ 
termine f. The error is then mainly determined by 
the ratio between the rank and the random projec¬ 
tion dimension. When the rank of X increases for a 
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fixed p , we need a larger projection dimension to ac¬ 
curately capture its spectrum. On the other hand, the 
failure probability increases with p and decreases with 
r. However, this countering effect is negligible as typ¬ 
ically log (p — t) <C r. 

5 Implementation and Experiments 


In this section we report on the empirical performance 
of Dual-Loco in two sets of experiments. The first 
demonstrates the performance of Dual-Loco in a 
large, distributed classification task. The second is 
an application of £2 penalized regression to a problem 
in climate science where accurate recovery of the coef¬ 
ficient estimates is of primary interest. 


Cross validation. In most practical cases, the regu¬ 
larization parameter A is unknown and has to be de¬ 
termined via u-fold cross validation (CV). The chosen 
algorithm is usually run entirely once for each fold and 
each of l values of A, leading to a runtime that is ap¬ 
proximately v ■ l as large as the runtime of a single 
ru 43 In this context, Dual-Loco has the advantage 
that steps 3 and 4 in Algorithm [l] are independent of 
A. Therefore, these steps only need to be performed 
once per fold. In step 5, we then estimate &k for each 
value in the provided sequence for A. Thus, the run¬ 
time of Dual-Loco will increase by much less than 
v ■ l compared to the runtime of a single run. The per¬ 
formance of each value for A is then not only averaged 
over the random split of the training data set into v 
parts but also over the randomness introduced by the 
random projections which are computed and commu¬ 
nicated once per fold. The procedure is provided in 
full detail in Algorithm [2] in Appendix [Cj 

Implementation details. We implemented Dual- 
Loco using the Apache Spark framework^] Spark 
is increasingly gaining traction in the research com¬ 
munity as well as in industry due to its easy-to-use 
high-level API and the benefits of in-memory process¬ 
ing. Spark is up to 100 x faster than Hadoop MapRe¬ 
duce. Additionally, Spark can be used in many differ¬ 
ent large-scale computing environments and the vari¬ 
ous, easily-integrated libraries for a diverse set of tasks 
greatly facilitate the development of applications. 


When communicating and summing the random fea¬ 
tures in Spark, Dual-Loco leverages the treeReduce 
scheme as illustrated in Figure S JbJ Summing has 
the advantage that increasing the number of work¬ 
ers simply introduces more layers in the tree struc¬ 
ture (Figure [2b]) while the load on the driver remains 


3 “Approximately” since the cross validation procedure 
also requires time for testing. For a single run we only 
count the time it takes to estimate the parameters. 

4 A software package will be made available under the 
Apache license. 



(a) (b) 

Figure 2: Schematic for the aggregation of the random 
features in Spark. | (a) | When concatenating the random 
features naively, every worker node (exec.) sends its ran¬ 
dom features to the driver from where they are broadcasted 
to all workers. |(b)| Using the treeReduce scheme we can 
reduce the load on the driver by summing the random fea¬ 
tures from each worker node as this operation is associative 
and commutative. Worker k is only required to subtract 
its own random features locally. 


constant and the aggregation operation also benefits 
from a parallel execution. Thus, when increasing K 
only relatively little additional communication cost is 
introduced which leads to speedups as demonstrated 
below. 


In practice, we used the discrete cosine transform 
(DOT) provided in the FFT library j Transforms in and 


we ran Dual-Loco as well as CoCoA + on a high- 
performance clustei[3 


Competing methods. For the classification exam¬ 
ple, the loss function is the hinge loss. Although 
the problem is non-smooth, and therefore not covered 
by our theory, we still obtain good results suggest¬ 
ing that Theorem [l] can be generalized to non-smooth 
losses. Alternatively, for classification the smoothed 
hinge or logistic losses could be used. For the regres¬ 
sion problem we use the squared error loss and modify 
CoCoA+ accordingly. As the LocalDualSolver we 
use SDCA [Iff}. 


We also compared Dual-Loco against the reference 
implementation of distributed loss minimization in the 
MLlib library in Spark using SGD and L-BFGS. How¬ 
ever, after extensive cross-validation over regulariza¬ 
tion strength (and step size and mini-batch size in case 
of SGD), we observed that the variance was still very 
large and so we omit the MLlib implementations from 
the figures. A comparison between C 0 C 0 A and vari¬ 
ants of SGD and mini-batch SDCA can be found in 
0 - 


“https://sites.google.com/site/piotrwendykier/ 
sof tware/j transf orms 

“For the Hadamard transform, r must be a power of 
two. For the DCT there is no restriction on r and very 
similar theoretical guarantees hold. 

' CoCoA + is also implemented in Spark with code avail¬ 
able from https://github.com/gingsmith/cocoa 
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Kaggle Dogs vs Cats dataset. This is a binary 
classification task consisting of 25,000 images of dogs 
and cat^] We resize all images to 430 x 430 pixels 
and use Overfeat [T9] - a pre-trained convolutional 
neural network - to extract p = 200, 704 fully dense 
feature vectors from the 19 th layer of the network for 
each image. We train on ntrain = 20, 000 images and 
test on the remaining nt es t = 5, 000. The size of the 
training data is 37GB with over 4 billion non-zero el¬ 
ements. All results we report in the following are av¬ 
eraged over five repetitions and by “runtime” we refer 
to wall clock time. 


0.05 


<0.02 



Algorithm 

• DUAL-LOCO 0.5 

• DUAL-LOCO 1 

• DUAL-LOCO 2 

• COCOA+ 


No. of Workers 


A 24 
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+ 96 
B 192 


o.oo- 

0.00 0.01 0.02 0.03 0.04 0.05 
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Figure [3] shows the median normalized training and 
test prediction MSE of Dual-Loco and CoCoA + for 
different numbers of worker^ For Dual-Loco, we 
also vary the size of the random feature representation 
and choose T su b s = {0.005,0.01,0.02} x (p — r). The 
corresponding errors are labeled with Dual-Loco 0.5, 
Dual-Loco 1 and Dual-Loco 2. Note that combi¬ 
nations of K and t su b s that would result in r < T su b s 
cannot be used, e.g. K = 192 and T su b s = 0.01 x (jp—r). 
We ran CoCoA + until a duality gap of 10~ 2 was at¬ 
tained so that the number of iterations varies for differ¬ 
ent numbers of worker^] Notably, for K = 48 more 
iterations were needed than in the other cases which 
is reflected in the very low training error in this case. 
The fraction of local points to be processed per round 
was set to 10%. We determined the regularization pa¬ 
rameter A via 5-fold cross validation. 

While the differences in training errors between Dual- 
LOCO and CoCoA + are notable, the differences be¬ 
tween the test errors are minor as long as the ran¬ 
dom feature representation is large enough. Choosing 
Tsubs to be only 0.5% of p — t seems to be slightly too 
small for this data set. When setting T su b s to be 1% 
of p — t the largest difference between the test errors 
of Dual-Loco and CoCoA + is 0.9%. The averaged 
mean squared prediction errors and their standard de¬ 
viations are collected in Table [l] in Appendix [C] 

Next, we would like to compare the wall clock time 
needed to find the regularization parameter A via 5- 
fold cross validation. For CoCoA + , using the num¬ 
ber of iterations needed to attain a duality gap of 
10~ 2 would lead to runtimes of more than 24 hours 
for K £ {48,96,192} when comparing l = 20 possi¬ 
ble values for A. One might argue that using a du¬ 
ality gap of 10 _1 is sufficient for the cross validation 
runs which would speed up the model selection pro¬ 
cedure significantly as much fewer iterations would be 

h https://www.kaggle.com/c/dogs-vs-cats 

s In practice, this choice will depend on the available 
resources in addition to the size of the data set. 

10 For K ranging from 12 to 192, the number of iterations 
needed were 77, 207, 4338,1966, resp. 3199. 


Figure 3: Dogs vs Cats data: Median normalized training 
and test prediction MSE based on 5 repetitions. 


DUAL-LOCO 0.5 DUAL-LOC01 DUAL-LOCO 2 COCOA+ 



Figure 4: Total wall clock time including 5-fold CV over 
l — 20 values for A. For CoCoA + , we use a duality gap 
(DG) of 10 -1 for the CV runs when K > 48. 

required. Therefore, for K > 48 we use a duality gap 
of 10” 1 during cross validation and a duality gap of 
10~ 2 for learning the parameters, once A has been de¬ 
termined. Figure [4] shows the runtimes when l = 20 
possible values for A are compared; Figure [cj^a)] com¬ 
pares the runtimes when cross validation is performed 
over l = 50 values. The absolute runtime of CoCoA + 
for a single run is smaller for K = 12 and K = 24 
and larger for I\ £ {48, 96,192}, so using more work¬ 
ers increased the amount of wall clock time necessary 
for job completion. The total runtime, including cross 
validation and a single run to learn the parameters 
with the determined value for A, is always smaller for 
Dual-Loco, except when K = 12 and l = 20. 

Figures [5|and |(|[b)] show the relative speedup of Dual- 
Loco and CoCoA + when increasing K. The speedup 
is computed by dividing the runtime for K = 12 
by the runtime achieved for the corresponding K = 
{24,48,96,192}. A speedup value smaller than 1 im¬ 
plies an increase in runtime. When considering a sin¬ 
gle run, we run CoCoA + in two different settings: (i) 
We use the number of iterations that are needed to 
obtain a duality gap of 10 -2 which varies for differ¬ 
ent number of workersP^. Here, the speedup is smaller 
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Figure 5: Relative speedup for (a) a single run and (b) 
5-fold CV over l = 20 values for 
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Figure 6: 5-fold C V ov er l = 50 values for A: (a) 
wall clock time and (b) relative speedup. 


Total 


than 1 for all K. (ii) We fix the number of outer it¬ 
erations to a constant number. As K increases, the 
number of inner iterations decreases, making it easier 
for CoCoA + to achieve a speedup. We found that 
although CoCoA + attains a speedup of 1.17 when in¬ 
creasing K from 12 to 48 (equivalent to a decrease in 
runtime of 14%), CoCoA + suffers a 24% increase in 
runtime when increasing K from 12 to 192. 

For Dual-Loco 0.5 and Dual-Loco 1 we observe 
significant speedups as K increases. As we split the 
design matrix by features the number of observations 
n remains constant for different number of workers. 
At the same time, the dimensionality of each worker’s 
local problem decreases with K. Together with the 
efficient aggregation of the random features, this leads 
to shorter runtimes. In case of Dual-Loco 2, the 
communication costs dominate the costs of computing 
the random projection and of the LocalDualSolver, 
resulting in much smaller speedups. 

Although CoCoA + was demonstrated to obtain 
speedups for low-dimensional data sets [2j it is plausi¬ 
ble that the same performance cannot be expected on 
a very high-dimensional data set. This illustrates that 
in such a high-dimensional setting splitting the design 
matrix according to the columns instead of the rows is 
more suitable. 

Climate data. This is a regression task where we 
demonstrate that the coefficients returned by Dual- 
Loco are interpretable. The data set contains the out¬ 
come of control simulations of the GISS global circula¬ 
tion model (201 and is part of the CMIP5 climate mod¬ 
eling ensemble. We aim to forecast the monthly global 
average temperature Y in February using the air pres¬ 
sure measured in January. Results are very similar for 
other months. The p = 10,368 features are pressure 
measurements taken at 10, 368 geographic grid points 


in January. The time span of the climate simulation 
is 531 years and we use the results from two control 
simulations, yielding rctrain = 849 and nt es t = 213. 


In Figure [7] we compare the coefficient estimates for 
four different methods. The problem is small enough 
to be solved on a single machine so that the full solu¬ 
tion can be computed (using SDCA; cf. Figure [^pijj ) . 
This allows us to report the normalized parameter es¬ 
timation mean squared error (MSEg) with respect to 
the full solution in addition to the normalized mean 
squared prediction error (MSE). The solution indi¬ 
cates that the pressure differential between Siberia 
(red area, top middle-left) and Europe and the North 
Atlantic (blue area, top left and top right) is a good 
predictor for the temperature anomaly. This pattern is 
concealed in Figure |i|[b)| which shows the result of up- 
projecting the coefficients estimated following a ran¬ 
dom projection of the columns. Using this scheme for 
prediction was introduced in [TS]. Although the MSE 
is similar to the optimal solution, the recovered coeffi¬ 
cients are not interpretable as suggested by [16] . Thus, 
this method should only be used if prediction is the sole 
interest. Figure ijc) shows the estimates returned by 
Dual-Loco which is able to recover estimates which 
are close to the full solution. Finally, Figure [^[d)| shows 
that CoCoA + also attains accurate results. 


Considering a longer time period or adding additional 
model variables such as temperature, precipitation or 
salinity rapidly increases the dimensionality of the 
problem while the number of observations remains 
constant. Each additional variable adds 10,368 di¬ 
mensions per month of simulation. Estimating very 
high-dimensional linear models is a significant chal¬ 
lenge in climate science and one where distributing 
the problem across features instead of observations is 
advantageous. The computational savings are much 
larger when distributing across features asp>rt and 
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(a) Single machine: Full solution (MSE = 0.72) 



(c) Distributed setting: Dual-Loco 10 with K — 4 
(MSE = 0.72, MSE 3 = 0.02) 



- 0.2 

- 0.0 
- - 0.2 



(b) Single machine: Column-wise compression 
(MSE = 0.73, MSE 3 = 21.28) 



- 0.00 

- - 0.05 



(d) Distributed setting: CoCoA + with K = 4 
(MSE = 0.72, MSE 3 = 0.01) 


Figure 7: Climate data: The regression coefficients are shown as maps with the prime median (passing through London) 
corresponding to the left and right edge of the plot. The Pacific Ocean lies in the center of each map. 


thus reducing p is associated with larger gains than 
when distributing across observations. 

6 Conclusions and further work 

We have presented Dual-Loco which considers the 
challenging and rarely studied problem of statistical 
estimation when data is distributed across features 
rather than samples. Dual-Loco generalizes LOCO to 
a wider variety of loss functions for regression and clas¬ 
sification. We show that the estimated coefficients are 
close to the optimal coefficients that could be learned 
by a single worker with access to the entire dataset. 
The resulting bound is more intuitive and tighter than 
previous bounds, notably with a very weak depen¬ 
dence on the number of workers. We have demon¬ 
strated that Dual-Loco is able to recover accurate 
solutions for large-scale estimation tasks whilst also 
achieving better scaling than a state-of-the-art com¬ 
petitor, CoCoA + , as K increases. Additionally, we 
have shown that Dual-Loco allows for fast model se¬ 
lection using cross-validation. 

The dual formulation is convenient for £ 2 penalized 
problems but other penalties are not as straightfor¬ 
ward. Similarly, the theory only holds for smooth 
loss functions. However, as demonstrated empirically 
Dual-Loco also performs well with a non-smooth loss 
function. 


As n grows very large, the random feature matrices 
may become too large to communicate efficiently even 
when the projection dimension is very small. For these 
situations, there are a few simple extensions we aim to 
explore in future work. One possibility is to first per¬ 
form row-wise random projections (c.f. 1211 ) to further 
reduce the communication requirement. Another op¬ 
tion is to distribute X according to rows and columns. 

Contrary to stochastic optimization methods, the com¬ 
munication of Dual-Loco is limited to a single round. 
For fixed n, p and T su b s , the amount of communication 
is deterministic and can be fixed ahead of time. This 
can be beneficial in settings where there are additional 
constraints on communication (for example when dif¬ 
ferent blocks of features are distributed a priori across 
different physical locations). 

Clearly with additional communication, the theoret¬ 
ical and practical performance of Dual-Loco could 
be improved. For example, uni suggest an iterative 
dual random projection scheme which can reduce the 
error in Lemma [4] exponentially. A related question 
for future research involves quantifying the amount of 
communication performed by Dual-Loco in terms of 
known minimax lower bounds j3j. 
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Supplementary Information for 
Dual-Loco: Distributing 

Statistical Estimation Using 
Random Projections 


Lemma 3 (Adapted from Lemma 1 [17]). Let at* and 
a be as defined in Definition^ 7] We obtain 

^(a-a*) T (K-K fc )a* > i(d-a*) T Kfc(a-a*). 

( 11 ) 


A Supplementary Results 


Proof. See [T71 . 


□ 


Here we introduce two lemmas. The first describes 
the random projection construction which we use in 
the distributed setting. 

Lemma 2 (Summing random features). Consider 
the singular value decomposition X = UEV T where 
U £ R nxr and V £ R pxr have orthonormal columns 
and X £ K rxr is diagonal; r = rankfiK). Cq is a 
fixed positive constant. In addition to the raw features, 
let Xfc £ R Tlx (' r + r »“'>„) contain random features which 
result from summing the K — 1 random projections 
from the other workers. Furthermore, assume without 
loss of generality that the problem is permuted so that 
the raw features of worker k’s problem are the first r 
columns of~X. andX k . Finally, let 


Os 


It 

0 


0 

n 


£ KP x ( T + r =*‘ ! ») 


such that Xfc = X© 5 . 

With probability at least 1 — (5 + ^f~) 


||V t ©s ©5 V 


V 1 VI 


2 < 


' Co log (2 r/S)r 

1~subs 


For our main result, we rely heavily on the following 
variant of Theorem 1 in m which bounds the differ¬ 
ence between the coefficients estimated by worker k, 
f3 k and the corresponding coordinates of the optimal 
solution vector (3* k . 

Lemma 4 (Local optimization error. Adapted from 
[T7]). For p = yj c ° l °sC^_ /syr_ t] le following holds 

\\K-vih< 

with probability at least 1 — (<5 + 2 (A) • 

The proof closely follows the proof of Theorem 1 in m 
which we restate here identifying the major differences. 

Proof. Let the quantities D k (ot), Kfc, be as in Defini- 
tion[l] For ease of notation, we shall omit the subscript 
k in D k (a) and Kfc in the following. 

By the SVD we have X = USV T . So K = USSU T 
and K = USV T nn T VSU T . We can make the fol¬ 
lowing definitions 

7 * = XU T a*, 7 = SU T a. 


Proof. See Appendix |B| □ 

Definition 1. For ease of exposition, we shall rewrite 
the dual problems so that we consider minimizing con¬ 
vex objective functions. More formally, the original 
problem is then given by 

a* = argmin i D(a) := V f* (af) + r\a T Ka 1 . 
a€ R n [ 2nA j 

(9) 

The problem worker k solves is described by 


Defining M = V T im T V and plugging these into 
Lemma [3] we obtain 

(7 - 7*) T (I - M) 7 * > (7 - 7*) T M(7 - 7*)- (12) 

We now bound the spectral norm of I — M using 
Lemma [2] Recall that Lemma [2] bounds the differ¬ 
ence between a matrix and its approximation by a dis¬ 
tributed dimensionality reduction using the SRHT. 

Using the Cauchy-Schwarz inequality we have for the 
l.h.s. of @ 


at = argmin i D k (ct) := V' f*(a t ) + — a T K k at i . 

agi" [ 2 nA | 

( 10 ) 

Recall that K/. = X^-Xj, where Xfc is the concatena¬ 
tion of the t raw features and T su b s random features 
for worker k. 

To proceed we need the following result which relates 
the solution of the original problem to that of the ap¬ 
proximate problem solved by worker k. 


(7 - 7*) T (l - M) 7 * < p||7*|| 2 ||7 ~ 7*II 2 


For the r.h.s. of (12), we can write 


(7-7*) t M(7- 7 *) 

= ll 7 - 7 *ll 2 -( 7 - 7 *) T (i-M) (7 
>II7-7*II2-Pll7-712 
= (1-P)||7 -7*111- 


7*) 
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Combining these two expressions and inequality (12) 
yields 


(1 - P)\\i ~ 7*11 2 < Pll7*l| 2 ||7 - 7*11 2 
(1-P)ll7-7*I| 2 <P||7*I| 2 - (13) 


Then 


(1 -5)l-K 


< <r P (SW) 

and 


From the definition of 7 * and 7 above and (3* and (3, 
respectively we have 


r = -^ v 7*, 


/3 = - x v 7 


so ill7*l| 2 = Wh and ll/3-/3*l| 2 = xll7-7*l| 2 due 
to the orthonormality of V. Plugging this into (13) 

and using the fact that \\(3* — (3\\ 2 > ||/3£ — /3 fc || 2 we 
obtain the stated result. □ 


B Proof of Row Summing Lemma 

Proof of Lemma [1| . Let ~V k contain the first r rows of 
V and let ~V be the matrix containing the remain¬ 
ing rows. Decompose the matrix products as follows 

V T V = VjV k + Vj_ k) V ( _ k) 

and 

v T 050jv = v^v fc + v A Tv fc 

with Vj = VJ fc) II. Then 

||V T 0 s 0jV-V t V || 2 
= ||V^V fc + vjv fc - vjv k - vj_ k) v { _ k) \\ 2 
= ||v ( r _ fe) nn T v ( _ fc) - v ( r _ fc) v ( _ fc) || 2 . 

Since 0s is an orthogonal matrix, from Lemma 3.3 
in [[15] and Lemma [ 5 ] summing (K — 1) independent 
SRHTs from r to t su i, s is equivalent to applying a sin¬ 
gle SRHT from p—r to T su b s - Therefore we can simply 
apply Lemma 1 of m to the above to obtain the re¬ 
sult. □ 

Lemma 5 (Summed row sampling). Let W be an nxp 
matrix with orthonormal columns. Let Wi,...,W;c 
be a balanced, random partitioning of the rows of W 
where each matrix W k has exactly t = n/K rows. 
Define the quantity M := n ■ max.j = i r _ n 11 W 11 § - For 
a positive parameter a, select the subsample size 

l ■ K > aMlog(p). 


ffi(SW) < 


(1 + rj)l ■ K 


with failure probability at most 


P■ 


\ e ~ s 1 

ot logp 


( 1 -y - 5 

1 p ' 

(i + v y+v_ 


ot logp 


Proof. Define w J as the j th row of W and M := n ■ 
max, 11 wj-11§ - Suppose K = 2 and consider the matrix 

G 2 : = (SjWi + S 2 W 2 ) t (S 1 W 1 + S 2 W 2 ) 

= (S 1 W 1 ) T (S 1 W 1 ) + (S 2 W 2 ) t (S 2 W 2 ) 

+ (S 1 W 1 ) t (S 2 W 2 ) + (S 2 W 2 ) t (S 1 W 1 ). 

In general, we can express G := (SW) T (SW) as 

G:= EE ( w j w I + E E w t w T 

k=lj£T k y k'^kj'eTJ. 



By the orthonormality of W, the cross terms cancel as 
WjWy = 0, yielding 

K 

G := (SW) T (SW) = ^ ^ wjwT. 

k =1 j£T k 


We can consider G as a sum of l ■ K random matrices 


v(l) -v(K) ^( 1 ) „(K) 

-*-1 ) • • • 5 A-l ! • • • I A 1 > • • • ! 


sampled uniformly at random without replacement 
from the family X := (w^w^ : i = 1,..., r • A'}. 

To use the matrix Chernoff bound in Lemma [ 6 ] we 
require the quantities /z m i n , /i max and B. Noticing that 
Amax(wjwJ) = IIWj||| < we can set B < M/n. 

Taking expectations with respect to the random par¬ 
titioning (Ep) and the subsampling within each par¬ 
tition (Eg), using the fact that columns of W are or¬ 
thonormal we obtain 


Let Sp fe € M. lXT denote the operation of uniformly 
at random sampling a subset, T k of the rows of W& 
by sampling l coordinates from { 1 , 2 ,... r} without re¬ 
placement. Now denote SW as the sum of the sub¬ 
sampled rows 

K 

SW = £ (S-zyWfc) . 

fc= 1 


E 


X 


(k) 


EpEpX^ 


1 1 
Kt 


^ 1 1 1 

y Wjw^ = —w T w = —i 

' n. 71. 


Recall that we take l samples in K blocks so we can 
define 


_l ■ K 

Mmin — 

Tl 


and 


1 ■ K 

/^max — 

Tl 
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Plugging these values into Lemma [6j the lower and 
upper Chernoff bounds respectively yield 


Amin (G) < (1 -8) 


IK 


<P- 


5 


( 1 - 6 ) 


1 -s 


n 
l-K/M 


for 8 £ [0,1), and 


Amax (G) > (1 + <5) 


<P- 


IK' 


Algorithm 


K 


TEST MSE 


TRAIN MSE 


(l + «) 


1+5 


Dual-Loco 0.5 

12 

0.0343 (3.75e-03) 

0.0344 (2.59e-03) 

Dual-Loco 0.5 

24 

0.0368 (4.22e-03) 

0.0344 (3.05e-03) 

Dual-Loco 0.5 

48 

0.0328 (3.97e-03) 

0.0332 (2.91e-03) 

Dual-Loco 0.5 

96 

0.0326 (3.13e-03) 

0.0340 (2.67e-03) 

Dual-Loco 0.5 

192 

0.0345 (3.82e-03) 

0.0345 (2.69e-03) 

Dual-Loco 1 

12 

0.0310 (2.89e-03) 

0.0295 (2.28e-03) 

Dual-Loco 1 

24 

0.0303 (2.87e-03) 

0.0307 (l-44e-03) 

Dual-Loco 1 

48 

0.0328 (1.92e-03) 

0.0329 (1.55e-03) 

Dual-Loco 1 

96 

0.0299 (l-07e-03) 

0.0299 (7.77e-04) 

Dual-Loco 2 

12 

0.0291 (2.16e-03) 

0.0280 (6.80e-04) 

Dual-Loco 2 

24 

0.0306 (2.38e-03) 

0.0279 (1.24e-03) 

Dual-Loco 2 

48 

0.0285 (6.11e-04) 

0.0293 (4.77e-04) 


n ) 

CoCoA+ 

12 

0.0282 (4.25e-18) 

0.0246 (2.45e-18) 

l-K/M 

CoCoA+ 

24 

0.0278 (3.47e-18) 

0.0212 (3.00e-18) 

for 8 > 0. 

CoCoA+ 

48 

0.0246 (6.01e-18) 

0.0011 (1.53e-19) 


CoCoA+ 

96 

0.0254 (5.49e-18) 

0.0137 (1.50e-18) 


CoCoA+ 

192 

0.0268 (1.23e-17) 

0.0158 (6.21e-18) 


Noting that A m i n (G) = <r p (G) 2 , similarly for A max and 
using the identity for G above obtains the desired re¬ 
sult. □ 


Table 1: Dogs vs Cats data: Normalized training and test 
MSE: mean and standard deviations (based on 5 repeti¬ 
tions). 


For ease of reference, we also restate the Matrix Cher¬ 
noff bound from C3123] but defer its proof to the 
original papers. 

Lemma 6 (Matrix Chernoff from |T3]). Let X be a 
finite set of positive-semidefinite matrices with dimen¬ 
sion p, and suppose that 


max A max 
AeA 


(A) < B 


Sample {Ai,...,A/} uniformly at random from X 
without replacement. Compute 


/Imin — ^-A m in(EXi) and /Tmax — l ' A max (EX| ) 


Then 


Amin ' A ij < (1 S)p n 

_(5 "j Mmin / B 


<P' 


L(i-5) 


1-5 


for S £ [0,1), and 


^max l? A v > (1 + 8)fi 

n 

§ ”| Mm ax / B 


<P' 


(1 + t>) 1+l5 _ 


for 8 > 0. 


Supplementary Material for 
Section [5] 


Algorithm 2 Dual-Loco - cross validation 


Input: Data: X, Y. no. workers: K , no. folds: v 
Parameters: T su b s , Ai,... A; 

1 : Partition {p} into K subsets of equal size r and distribute 
feature vectors in X accordingly over K workers. 
Partition {n} into v folds of equal size, 
for each fold / do 

Communicate indices of training and test points, 
for each worker k £ {1, ... K} in parallel do 

Compute and send X t l fj ln Tlhj. 


Receive random features and construct X)(™ m . 
for each A_, £ {Ai,... A;} do 

3 , 


<- LocalDualSolver(Xf? m , Yj rain , Xj) 


J kj,\ 

Y ’test 

i 


_ 1 v train ' pc 

- a kJ,\j 


_ vtestfD 

~ kj Pk 


Send 


9 
10 
11 
12 

13 

14 

15 

16 

17 

18 

19 

20 
21 
22 : 

Output: Parameter A j attaining smallest MSEa 


to driver. 

n,,j ,/a 7 

end for 
end for 

for each A j £ {,... A;} do 

Compute Y}% = Eti Kj!x r 
Compute MSE fg. with Yffi* and Yj est . 

end for 
end for 

for each Xj £ {Ai,... A;} do 

Compute MSEa 3 = A i m SE/ jAj . . 

end for 






























